With 162 games played for 30 teams a year, anything can happen. Literally anything. Baseball is also interesting, especially for data scientists, because more than any other sport, much of the sport is based of off statistics. Don’t get it wrong, other sports also rely on similar statistics to develop the right stratagy. Doug Pederson uses statistics throughout every facet of how he runs the Philadelphia Eagles, especially on fourth downs, and it paid of big, leading to the Eagles first Superbowl win and led to an epic celebration.
Statistics, while important beforehand, got thrust to the forefront during the Moneyball Era. For a brief moment in time, small market franchises, teams with less money to spend on recruiting players, such as the Oakland Athletics were able to gain a slight edge over big market teams, teams with a lot of money to spend recruiting players. Now that big market teams have caught on, how does a small market team, like the Houston Astros, beat a big market team, like the Los Angeles Dodgers in game 7 of the World Series? The answer: youth.
Young players are inexpensive, but a bit risky. They could turn out to be the next Mike Trout, or the next Mark Appel. Most importantly, they have six years of team control when they first hit the major leagues, meaning that only the team the player is on can sign them, unless, of course, they don’t want to. This has lead teams to trading their older, better players for young talent, rebuilding their farm system with potential stars. When those stars hit The Show and a team thinks they have the talent to make a World Series push, they will trade for other star players, often from teams looking to rebuild, to fill the gaps in talent that they have, sacrificing some of the youth in their minor leagues.
This study will follow five teams that have followed this model more or less to see how effective this stratagy is. The study will show how R can be used to easily manipulate data, removing less important stats and adding in more important ones, graph data, and make theories. The teams that this study will follow are the Chicago Cubs, the Houston Astros, the Philadelphia Phillies, the New York Yankees, and the Atlanta Braves. The Cubs and Astros were chosen as both ended long championship droughts with a young team. The Phillies went on to win the World Series in 2008, but crashed in 2011 and have been rebuilding since. The Braves looked poised to start making a run in the early 2010s, but crashed hard. Both the Phillies and Braves are coming out of the rebuilding and so far in 2018, both have looked successful. Yankees have always seemed good, but went through a brief rebuild in 2016. It was shorter than most others as they were already competitive, making the playoffs in 2017.
There are two main sources of baseball data available online: and . This study will take from baseball-reference. Data will be taken from baseball-reference to show how data can be important via CSV files and via the internet. Ten years of data will be taken from 2008 to 2008 since the 2018 season has just begun. The 2018 year will prove to be an interesting addition as every team, with maybe the exception of the Cubs, are off to great starts. Even then, Cubs are doing well, but not as well as one might expect.
There are a number of ways to import datasets into R, whether it by from an SQL database, CSV, or from an html webpage. Baseball-reference allows you to easily download CSVs, which you can put into your working directory. In the following example, we’ve taken the batting and pitching general statistics CSVs from the 2008 Chicago Cubs. The files were downloaded as xls files and converted to csv files using Excel. Using read_csv from the tidyverse makes this really easy.
chc_08_bat <- read_csv("sportsref_chc_08_bat.xls.csv")
chc_08_pitch <- read_csv("sportsref_chc_08_pitch.xls.csv")
head(chc_08_pitch,10)
Easy, right? One command, one dataframe. Parsing from a website is a bit trickier as you have to locate where in the source html the table you want is stored. However, it does come with the benefit of not needing to download extra files. If we were to get all our data from downloaded files, we would have 100 CSVs. 10 years and five teams with two tables per adds up quickly. The following example shows how we can extract those tables directly from baseball-reference and into our program using the 2008 Houston Astros.
url_stros_08 <- "https://www.baseball-reference.com/teams/HOU/2008.shtml"
hou_08_bat <- url_stros_08 %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_08_pitch <- url_stros_08 %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
head(hou_08_bat,10)
## Rk Pos Name Age G PA AB R H 2B 3B HR RBI SB CS BB SO
## 1 1 C Brad Ausmus 39 81 250 216 15 47 8 0 3 24 0 2 25 41
## 2 2 1B Lance Berkman# 32 159 665 554 114 173 46 4 29 106 18 4 99 108
## 3 3 2B Kazuo Matsui# 32 96 422 375 58 110 26 3 6 33 20 5 37 53
## 4 4 SS Miguel Tejada 34 158 666 632 92 179 38 3 13 66 7 7 24 72
## 5 5 3B Ty Wigginton 30 111 429 386 50 110 22 1 23 58 4 6 32 69
## 6 6 LF Carlos Lee 32 115 481 436 61 137 27 0 28 100 4 1 37 49
## 7 7 CF Michael Bourn* 25 138 514 467 57 107 10 4 5 29 41 10 37 111
## 8 8 RF Hunter Pence 25 157 642 595 78 160 34 4 25 83 11 10 40 124
## 9 Rk Pos Name Age G PA AB R H 2B 3B HR RBI SB CS BB SO
## 10 9 3B Geoff Blum# 35 114 356 325 36 78 14 1 14 53 1 2 21 54
## BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB
## 1 .218 .303 .296 .600 61 64 4 2 6 1 3
## 2 .312 .420 .567 .986 160 314 13 7 0 5 18
## 3 .293 .354 .427 .781 107 160 3 0 7 3 0
## 4 .283 .314 .415 .729 92 262 32 6 1 3 4
## 5 .285 .350 .526 .876 129 203 9 8 0 3 1
## 6 .314 .368 .569 .937 145 248 8 3 0 5 7
## 7 .229 .288 .300 .588 57 140 3 2 7 1 0
## 8 .269 .318 .466 .783 105 277 14 4 0 3 2
## 9 BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB
## 10 .240 .287 .418 .705 85 136 5 3 0 7 2
Using the rvest package, we see that importing data in this way is pretty simple too. The hardest part is looking through the html source code to find where the table is located. Fortunately, baseball-reference has made this easy for us. Sometimes, websites aren’t very good at stroing this tables, like this example which has the table hardcoded in as plaintext. In those cases, parsing the data gets a little more tricky, and regex comes into handy. Regex parsing will be covered in the next section.
One other thing to note in this example is the %>%. This is called pipe operator and it allows us to call multiple functions on a single dataset at once. We will use this many times when working with our data.
The following code obotains the rest of the data we need from the remaining pages, split up by team
url_stros <- "https://www.baseball-reference.com/teams/HOU/2009.shtml"
hou_09_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_09_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2010.shtml"
hou_10_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_10_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2011.shtml"
hou_11_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_11_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2012.shtml"
hou_12_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_12_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2013.shtml"
hou_13_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_13_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2014.shtml"
hou_14_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_14_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2015.shtml"
hou_15_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_15_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2016.shtml"
hou_16_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_16_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_stros <- "https://www.baseball-reference.com/teams/HOU/2017.shtml"
hou_17_bat <- url_stros %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
hou_17_pitch <- url_stros %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
Chicago Cubs:
url_chc <- "https://www.baseball-reference.com/teams/CHC/2009.shtml"
chc_09_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_09_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2010.shtml"
chc_10_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_10_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2011.shtml"
chc_11_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_11_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2012.shtml"
chc_12_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_12_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2013.shtml"
chc_13_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_13_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2014.shtml"
chc_14_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_14_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2015.shtml"
chc_15_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_15_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2016.shtml"
chc_16_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_16_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_chc <- "https://www.baseball-reference.com/teams/CHC/2017.shtml"
chc_17_bat <- url_chc %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
chc_17_pitch <- url_chc %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
Philadelphia Phillies
url_phi <- "https://www.baseball-reference.com/teams/PHI/2008.shtml"
phi_08_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_08_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2009.shtml"
phi_09_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_09_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2010.shtml"
phi_10_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_10_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2011.shtml"
phi_11_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_11_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2012.shtml"
phi_12_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_12_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2013.shtml"
phi_13_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_13_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2014.shtml"
phi_14_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_14_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2015.shtml"
phi_15_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_15_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2016.shtml"
phi_16_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_16_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_phi <- "https://www.baseball-reference.com/teams/PHI/2017.shtml"
phi_17_bat <- url_phi %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
phi_17_pitch <- url_phi %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
New York Yankees
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2008.shtml"
nyy_08_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_08_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2009.shtml"
nyy_09_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_09_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2010.shtml"
nyy_10_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_10_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2011.shtml"
nyy_11_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_11_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2012.shtml"
nyy_12_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_12_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2013.shtml"
nyy_13_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_13_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2014.shtml"
nyy_14_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_14_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2015.shtml"
nyy_15_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_15_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2016.shtml"
nyy_16_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_16_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_nyy <- "https://www.baseball-reference.com/teams/NYY/2017.shtml"
nyy_17_bat <- url_nyy %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
nyy_17_pitch <- url_nyy %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
Atlanta Braves
url_atl <- "https://www.baseball-reference.com/teams/ATL/2008.shtml"
atl_08_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_08_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2009.shtml"
atl_09_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_09_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2010.shtml"
atl_10_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_10_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2011.shtml"
atl_11_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_11_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2012.shtml"
atl_12_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_12_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2013.shtml"
atl_13_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_13_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2014.shtml"
atl_14_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_14_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2015.shtml"
atl_15_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_15_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2016.shtml"
atl_16_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_16_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
url_atl <- "https://www.baseball-reference.com/teams/ATL/2017.shtml"
atl_17_bat <- url_atl %>%
read_html() %>%
html_node("#team_batting") %>%
html_table()
atl_17_pitch <- url_atl %>%
read_html() %>%
html_node("#team_pitching") %>%
html_table()
Fortunately for us, all this data comes in pretty clean, but we still need to fix it up and make it pretty. For example, in all the dataframes, except for the one we imported as a CSV, repeat the attribute names, so we’ll get rid of that. Additionally, most of the attributes are integers, but because of that pesky last line, they are listed as characters. To make matters worse, this line appears in the middle of each frame at seemingly random locations. The following example fixes this for the 2008 Atlanta Braves. The rest are fixed at the end of the example.
# This command will remove any row with an attribute in the Rk column with the value Rk
atl_08_bat <- atl_08_bat[!(atl_08_bat$Rk=="Rk"),]
atl_08_pitch <- atl_08_pitch[!(atl_08_pitch$Rk=="Rk"),]
# This will set each column to the proper type. The sapply applies the function to the specified columns
atl_08_bat[, c(1,4:28)] <- sapply(atl_08_bat[, c(1,4:28)], as.numeric)
atl_08_pitch[, c(1,4:34)] <- sapply(atl_08_pitch[, c(1,4:34)], as.numeric)
head(atl_08_pitch,10)
## Rk Pos Name Age W L W-L% ERA G GS GF CG SHO SV IP H
## 1 1 SP Jair Jurrjens 22 13 10 0.565 3.68 31 31 0 0 0 0 188.1 188
## 2 2 SP Jorge Campillo 29 8 7 0.533 3.91 39 25 3 1 0 0 158.2 158
## 3 3 SP Tim Hudson 32 11 7 0.611 3.17 23 22 0 1 1 0 142.0 125
## 4 4 SP Jo-Jo Reyes* 23 3 11 0.214 5.81 23 22 0 0 0 0 113.0 134
## 5 5 SP Mike Hampton* 35 3 4 0.429 4.85 13 13 0 0 0 0 78.0 83
## 6 6 SP Charlie Morton 24 4 8 0.333 6.15 16 15 0 0 0 0 74.2 80
## 8 7 CL Mike Gonzalez* 30 0 3 0.000 4.28 36 0 29 0 0 14 33.2 26
## 9 8 RP Jeff Bennett 28 3 7 0.300 3.70 72 4 12 0 0 3 97.1 86
## 10 9 RP Blaine Boyer 26 2 6 0.250 5.88 76 0 18 0 0 1 72.0 73
## 11 10 RP Will Ohman* 30 4 1 0.800 3.68 83 0 16 0 0 1 58.2 51
## R ER HR BB IBB SO HBP BK WP BF ERA+ FIP WHIP H9 HR9 BB9 SO9
## 1 87 77 11 70 9 139 4 0 3 813 113 3.59 1.370 9.0 0.5 3.3 6.6
## 2 74 69 18 38 2 107 1 0 2 655 107 4.00 1.235 9.0 1.0 2.2 6.1
## 3 53 50 11 40 5 85 2 1 3 573 132 3.83 1.162 7.9 0.7 2.5 5.4
## 4 77 73 18 52 4 78 3 0 2 512 72 5.28 1.646 10.7 1.4 4.1 6.2
## 5 45 42 10 28 6 38 1 0 0 331 86 4.94 1.423 9.6 1.2 3.2 4.4
## 6 56 51 9 41 2 48 2 0 2 345 68 5.14 1.621 9.6 1.1 4.9 5.8
## 8 21 16 6 14 3 44 1 0 0 142 98 4.17 1.188 7.0 1.6 3.7 11.8
## 9 44 40 5 47 6 68 7 0 5 419 113 4.07 1.366 8.0 0.5 4.3 6.3
## 10 51 47 10 25 4 67 2 0 2 313 71 4.20 1.361 9.1 1.3 3.1 8.4
## 11 27 24 3 22 4 53 1 0 2 248 114 3.17 1.244 7.8 0.5 3.4 8.1
## SO/W
## 1 1.99
## 2 2.82
## 3 2.13
## 4 1.50
## 5 1.36
## 6 1.17
## 8 3.14
## 9 1.45
## 10 2.68
## 11 2.41
One other thing we need to do clean this data is with the names. For whatever reason, baseball-reference decided to use characters next to a person’s name to show what side of the plate they bat from. No symbol means they bat right handed, a * means they bat left handed and a # means they bat from both sides of the plate. This blends a little into our next section, data management, because we are going to create a new attribute called bat_side that will contain a single character: R for righties, L for lefties, and S for switch hitters. We will also get rid of that symbol on the person’s name. There is also one for the pitchers, but here we won’t have to deal with switch pitchers as Pat Vendette did not play for any of these teams on during this era.
This is a good opportunity to introduce regex, which is contained in the stringr package for R. Here’s a regex cheat sheet for R. By using regex, we can check through each string and parse out the important bits and discard the rest.
atl_08_bat <- atl_08_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_08_pitch <- atl_08_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
head(atl_08_bat,10)
## Rk Pos Name Age G PA AB R H 2B 3B HR RBI SB CS BB SO
## 1 1 C Brian McCann 24 145 573 509 68 153 42 1 23 87 5 0 57 64
## 2 2 1B Mark Teixeira 28 103 451 381 63 108 27 0 20 78 0 0 65 70
## 3 3 2B Kelly Johnson 26 150 614 547 86 157 39 6 12 69 11 6 52 113
## 4 4 SS Yunel Escobar 25 136 587 514 71 148 24 2 10 60 2 5 59 62
## 5 5 3B Chipper Jones 36 128 534 439 82 160 24 1 22 75 4 0 90 61
## 6 6 LF Gregor Blanco 24 144 519 430 52 108 14 4 1 38 13 5 74 99
## 7 7 CF Mark Kotsay 32 88 345 318 39 92 17 3 6 37 2 3 25 34
## 8 8 RF Jeff Francoeur 24 155 653 599 70 143 33 3 11 71 0 1 39 111
## 9 9 UT Omar Infante 26 96 348 317 45 93 24 3 3 40 0 1 22 44
## 10 10 IF Martin Prado 24 78 254 228 36 73 18 4 2 33 3 1 21 29
## BA OBP SLG OPS OPS+ TB GDP HBP SH SF IBB bat_side
## 1 0.301 0.373 0.523 0.896 135 266 17 4 0 3 4 L
## 2 0.283 0.390 0.512 0.902 137 195 13 3 0 2 9 S
## 3 0.287 0.349 0.446 0.795 109 244 3 2 9 4 2 L
## 4 0.288 0.366 0.401 0.766 103 206 24 5 7 2 4 R
## 5 0.364 0.470 0.574 1.044 176 252 13 1 0 4 16 S
## 6 0.251 0.366 0.309 0.676 82 133 3 6 6 3 2 L
## 7 0.289 0.340 0.418 0.758 100 133 13 0 1 1 2 L
## 8 0.239 0.294 0.359 0.653 72 215 18 10 0 4 5 R
## 9 0.293 0.338 0.416 0.755 99 132 4 2 2 5 2 R
## 10 0.320 0.377 0.461 0.838 121 105 3 1 2 2 0 R
So once you understand regex, things become pretty easy. Now we speed up the process, doing this for our 100 dataframes as seen below for your viewing pleasure.
atl_09_bat <- atl_09_bat[!(atl_09_bat$Rk=="Rk"),]
atl_09_pitch <- atl_09_pitch[!(atl_09_pitch$Rk=="Rk"),]
atl_09_bat[, c(1,4:28)] <- sapply(atl_09_bat[, c(1,4:28)], as.numeric)
atl_09_pitch[, c(1,4:34)] <- sapply(atl_09_pitch[, c(1,4:34)], as.numeric)
atl_09_bat <- atl_09_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_09_pitch <- atl_09_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_10_bat <- atl_10_bat[!(atl_10_bat$Rk=="Rk"),]
atl_10_pitch <- atl_10_pitch[!(atl_10_pitch$Rk=="Rk"),]
atl_10_bat[, c(1,4:28)] <- sapply(atl_10_bat[, c(1,4:28)], as.numeric)
atl_10_pitch[, c(1,4:34)] <- sapply(atl_10_pitch[, c(1,4:34)], as.numeric)
atl_10_bat <- atl_10_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_10_pitch <- atl_10_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_11_bat <- atl_11_bat[!(atl_11_bat$Rk=="Rk"),]
atl_11_pitch <- atl_11_pitch[!(atl_11_pitch$Rk=="Rk"),]
atl_11_bat[, c(1,4:28)] <- sapply(atl_11_bat[, c(1,4:28)], as.numeric)
atl_11_pitch[, c(1,4:34)] <- sapply(atl_11_pitch[, c(1,4:34)], as.numeric)
atl_11_bat <- atl_11_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_11_pitch <- atl_11_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_12_bat <- atl_12_bat[!(atl_12_bat$Rk=="Rk"),]
atl_12_pitch <- atl_12_pitch[!(atl_12_pitch$Rk=="Rk"),]
atl_12_bat[, c(1,4:28)] <- sapply(atl_12_bat[, c(1,4:28)], as.numeric)
atl_12_pitch[, c(1,4:34)] <- sapply(atl_12_pitch[, c(1,4:34)], as.numeric)
atl_12_bat <- atl_12_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_12_pitch <- atl_12_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_13_bat <- atl_13_bat[!(atl_13_bat$Rk=="Rk"),]
atl_13_pitch <- atl_13_pitch[!(atl_13_pitch$Rk=="Rk"),]
atl_13_bat[, c(1,4:28)] <- sapply(atl_13_bat[, c(1,4:28)], as.numeric)
atl_13_pitch[, c(1,4:34)] <- sapply(atl_13_pitch[, c(1,4:34)], as.numeric)
atl_13_bat <- atl_13_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_13_pitch <- atl_13_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_14_bat <- atl_14_bat[!(atl_14_bat$Rk=="Rk"),]
atl_14_pitch <- atl_14_pitch[!(atl_14_pitch$Rk=="Rk"),]
atl_14_bat[, c(1,4:28)] <- sapply(atl_14_bat[, c(1,4:28)], as.numeric)
atl_14_pitch[, c(1,4:34)] <- sapply(atl_14_pitch[, c(1,4:34)], as.numeric)
atl_14_bat <- atl_14_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_14_pitch <- atl_14_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_15_bat <- atl_15_bat[!(atl_15_bat$Rk=="Rk"),]
atl_15_pitch <- atl_15_pitch[!(atl_15_pitch$Rk=="Rk"),]
atl_15_bat[, c(1,4:28)] <- sapply(atl_15_bat[, c(1,4:28)], as.numeric)
atl_15_pitch[, c(1,4:34)] <- sapply(atl_15_pitch[, c(1,4:34)], as.numeric)
atl_15_bat <- atl_15_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_15_pitch <- atl_15_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_16_bat <- atl_16_bat[!(atl_16_bat$Rk=="Rk"),]
atl_16_pitch <- atl_10_pitch[!(atl_16_pitch$Rk=="Rk"),]
atl_16_bat[, c(1,4:28)] <- sapply(atl_16_bat[, c(1,4:28)], as.numeric)
atl_16_pitch[, c(1,4:34)] <- sapply(atl_16_pitch[, c(1,4:34)], as.numeric)
atl_16_bat <- atl_16_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_16_pitch <- atl_16_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_17_bat <- atl_10_bat[!(atl_17_bat$Rk=="Rk"),]
atl_17_pitch <- atl_10_pitch[!(atl_17_pitch$Rk=="Rk"),]
atl_17_bat[, c(1,4:28)] <- sapply(atl_17_bat[, c(1,4:28)], as.numeric)
atl_17_pitch[, c(1,4:34)] <- sapply(atl_17_pitch[, c(1,4:34)], as.numeric)
atl_17_bat <- atl_17_bat %>%
mutate(bat_side=ifelse(str_detect(Name,"\\*"),'L', ifelse(str_detect(Name,"#"),'S','R'))) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
atl_17_pitch <- atl_17_pitch %>%
mutate(pitch_side=ifelse(str_detect(Name,"\\*"),'L', 'R')) %>%
mutate(Name = str_extract(Name,"[[:alpha:]]+.?[[:alpha:]]+.? [[:alpha:]]+"))
The other teams’ code to parse the data is hidden just so this document isn’t obscenely long, but it was proccessed.
Ok, so now we have a metric ton of data that has been cleaned up, what can we do with it? Well, actually, we can do a lot. For example, we could calculate a player’s WAR, or Wins Above Replacement. WAR is a calculation of how good a player is over whatever Joe Schmoe would come up to replace him. It’s an incredibly useful number, but rather difficult to calculate. In fact, both baseball-reference and fangraphs calculate it in different ways. So instead, let’s keep things simple. First, let’s clean up our data.
If you notice, all of the dataframes have the last couple rows as some summary statistics. The final entry is all the attributes re-listed, but we already got rid of that. We’ll want to summary statistics, so let’s stash that away temporarily and see what we can do with the clean dataframe. We can use the slice command to show which ones we don’t want. As before, we’ll look at just one team, this time the 2008 Philadelphia Phillies.
phi_08_a <- slice(phi_08_bat,(nrow(phi_08_bat)-3))
phi_08_b <- slice(phi_08_pitch,(nrow(phi_08_pitch)-1))
phi_08 <- merge(phi_08_a,phi_08_b,by="Name",all=TRUE)
phi_08_bat <- slice(phi_08_bat, 1:(nrow(phi_08_bat)-4))
head(phi_08_bat,10)
## # A tibble: 10 x 29
## Rk Pos Name Age G PA AB R H `2B` `3B` HR
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.00 C Carl~ 29.0 117 373 320 47.0 70.0 14.0 0 4.00
## 2 2.00 1B Ryan~ 28.0 162 700 610 105 153 26.0 4.00 48.0
## 3 3.00 2B Chas~ 29.0 159 707 607 113 177 41.0 4.00 33.0
## 4 4.00 SS Jimm~ 29.0 137 625 556 76.0 154 38.0 9.00 11.0
## 5 5.00 3B Pedr~ 33.0 133 463 425 43.0 106 19.0 2.00 14.0
## 6 6.00 LF Pat ~ 31.0 157 645 536 74.0 134 33.0 3.00 33.0
## 7 7.00 CF Shan~ 27.0 146 627 570 102 167 30.0 8.00 14.0
## 8 8.00 RF Jays~ 29.0 134 482 418 73.0 114 16.0 3.00 24.0
## 9 9.00 RF Geof~ 33.0 115 322 293 27.0 72.0 16.0 0 9.00
## 10 10.0 C Chri~ 35.0 98.0 305 274 28.0 72.0 17.0 0 9.00
## # ... with 17 more variables: RBI <dbl>, SB <dbl>, CS <dbl>, BB <dbl>,
## # SO <dbl>, BA <dbl>, OBP <dbl>, SLG <dbl>, OPS <dbl>, `OPS+` <dbl>,
## # TB <dbl>, GDP <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>, IBB <dbl>,
## # bat_side <chr>
Alternatively, we can filter them out using filter
phi_08_pitch <- filter(phi_08_pitch, phi_08_pitch$Rk!='NA')
head(phi_08_pitch,10)
## Rk Pos Name Age W L W-L% ERA G GS GF CG SHO SV IP H
## 1 1 SP Cole Hamels 24 14 10 0.583 3.09 33 33 0 2 2 0 227.1 193
## 2 2 SP Jamie Moyer 45 16 7 0.696 3.71 33 33 0 0 0 0 196.1 199
## 3 3 SP Brett Myers 27 10 13 0.435 4.55 30 30 0 2 1 0 190.0 197
## 4 4 SP Kyle Kendrick 23 11 9 0.550 5.49 31 30 1 0 0 0 155.2 194
## 5 5 SP Adam Eaton 30 4 8 0.333 5.80 21 19 0 0 0 0 107.0 131
## 6 6 SP Joe Blanton 27 4 0 1.000 4.20 13 13 0 0 0 0 70.2 66
## 7 7 CL Brad Lidge 31 2 0 1.000 1.95 72 0 61 0 0 41 69.1 50
## 8 8 RP Chad Durbin 30 5 4 0.556 2.87 71 0 12 0 0 1 87.2 81
## 9 9 RP Ryan Madson 27 4 2 0.667 3.05 76 0 14 0 0 1 82.2 79
## 10 10 RP Clay Condrey 32 3 4 0.429 3.26 56 0 30 0 0 1 69.0 85
## R ER HR BB IBB SO HBP BK WP BF ERA+ FIP WHIP H9 HR9 BB9 SO9
## 1 89 78 28 53 7 196 1 0 0 914 141 3.72 1.082 7.6 1.1 2.1 7.8
## 2 85 81 20 62 4 123 11 0 3 841 117 4.32 1.329 9.1 0.9 2.8 5.6
## 3 103 96 29 65 6 163 6 0 2 817 96 4.52 1.379 9.3 1.4 3.1 7.7
## 4 103 95 23 57 2 68 14 1 4 722 79 5.55 1.612 11.2 1.3 3.3 3.9
## 5 71 69 15 44 5 57 6 0 2 478 75 5.29 1.636 11.0 1.3 3.7 4.8
## 6 36 33 10 31 0 49 3 0 1 305 104 5.03 1.373 8.4 1.3 3.9 6.2
## 7 17 15 2 35 4 92 1 0 5 292 224 2.41 1.226 6.5 0.3 4.5 11.9
## 8 33 28 5 35 7 63 4 0 3 365 152 3.77 1.323 8.3 0.5 3.6 6.5
## 9 29 28 6 23 4 67 1 1 2 340 143 3.33 1.234 8.6 0.7 2.5 7.3
## 10 26 25 6 19 8 34 2 0 1 303 134 4.19 1.507 11.1 0.8 2.5 4.4
## SO/W pitch_side
## 1 3.70 L
## 2 1.98 L
## 3 2.51 R
## 4 1.19 R
## 5 1.30 R
## 6 1.58 R
## 7 2.63 R
## 8 1.80 R
## 9 2.91 R
## 10 1.79 R
Say we decide that slugging percentage is useless, we can get rid of that using select
phi_08_bat <- select(phi_08_bat, -SLG)
head(phi_08_bat,10)
## # A tibble: 10 x 28
## Rk Pos Name Age G PA AB R H `2B` `3B` HR
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.00 C Carl~ 29.0 117 373 320 47.0 70.0 14.0 0 4.00
## 2 2.00 1B Ryan~ 28.0 162 700 610 105 153 26.0 4.00 48.0
## 3 3.00 2B Chas~ 29.0 159 707 607 113 177 41.0 4.00 33.0
## 4 4.00 SS Jimm~ 29.0 137 625 556 76.0 154 38.0 9.00 11.0
## 5 5.00 3B Pedr~ 33.0 133 463 425 43.0 106 19.0 2.00 14.0
## 6 6.00 LF Pat ~ 31.0 157 645 536 74.0 134 33.0 3.00 33.0
## 7 7.00 CF Shan~ 27.0 146 627 570 102 167 30.0 8.00 14.0
## 8 8.00 RF Jays~ 29.0 134 482 418 73.0 114 16.0 3.00 24.0
## 9 9.00 RF Geof~ 33.0 115 322 293 27.0 72.0 16.0 0 9.00
## 10 10.0 C Chri~ 35.0 98.0 305 274 28.0 72.0 17.0 0 9.00
## # ... with 16 more variables: RBI <dbl>, SB <dbl>, CS <dbl>, BB <dbl>,
## # SO <dbl>, BA <dbl>, OBP <dbl>, OPS <dbl>, `OPS+` <dbl>, TB <dbl>,
## # GDP <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>, IBB <dbl>, bat_side <chr>
It can also be used to select the attributes we find useful.
As it turns out, slugging percentage is actually a really useful stat, so we need to add it back in. Fortunately, we have enough data to be able to recalculate it. Slugging percentage is calculated as: \[ \frac{\text{# of Singles} \times 1 + \text{# of Doubles} \times 2 + \text{# of Triples} \times 3 + \text{# of Homeruns} \times 4}{\text{# of At-Bats}}\]
So we can insert in a way we’ve already seen:
phi_08_bat <- phi_08_bat %>%
mutate(SLG = ((H-`2B`-`3B`-HR)+`2B`*2+`3B`*3+HR*4)/AB)
head(phi_08_bat,10)
## # A tibble: 10 x 29
## Rk Pos Name Age G PA AB R H `2B` `3B` HR
## <dbl> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.00 C Carl~ 29.0 117 373 320 47.0 70.0 14.0 0 4.00
## 2 2.00 1B Ryan~ 28.0 162 700 610 105 153 26.0 4.00 48.0
## 3 3.00 2B Chas~ 29.0 159 707 607 113 177 41.0 4.00 33.0
## 4 4.00 SS Jimm~ 29.0 137 625 556 76.0 154 38.0 9.00 11.0
## 5 5.00 3B Pedr~ 33.0 133 463 425 43.0 106 19.0 2.00 14.0
## 6 6.00 LF Pat ~ 31.0 157 645 536 74.0 134 33.0 3.00 33.0
## 7 7.00 CF Shan~ 27.0 146 627 570 102 167 30.0 8.00 14.0
## 8 8.00 RF Jays~ 29.0 134 482 418 73.0 114 16.0 3.00 24.0
## 9 9.00 RF Geof~ 33.0 115 322 293 27.0 72.0 16.0 0 9.00
## 10 10.0 C Chri~ 35.0 98.0 305 274 28.0 72.0 17.0 0 9.00
## # ... with 17 more variables: RBI <dbl>, SB <dbl>, CS <dbl>, BB <dbl>,
## # SO <dbl>, BA <dbl>, OBP <dbl>, OPS <dbl>, `OPS+` <dbl>, TB <dbl>,
## # GDP <dbl>, HBP <dbl>, SH <dbl>, SF <dbl>, IBB <dbl>, bat_side <chr>,
## # SLG <dbl>
There is a whole bunch of other things we can do, like renaming attributes, but what we are most interested in is the summary of each dataset. We start by merging datasets together by certain attributes. So instead of having two datasets per team per year, we can create one dataset for a team in a year.
phi_08_c <- merge(phi_08_bat,phi_08_pitch,by="Name",all=TRUE)
head(phi_08_c,10)
## Name Rk.x Pos.x Age.x G.x PA AB R.x H.x 2B 3B HR.x RBI SB
## 1 Adam Eaton 27 P 30 22 36 28 1 5 2 0 0 1 0
## 2 Andy Tracy 21 34 4 4 2 0 0 0 0 0 1 0
## 3 Brad Harman 16 IF 22 6 11 10 1 1 1 0 0 1 0
## 4 Brad Lidge 38 P 31 71 0 0 0 0 0 0 0 0 0
## 5 Brett Myers 24 P 27 30 71 58 3 4 1 0 0 1 0
## 6 Carlos Ruiz 1 C 29 117 373 320 47 70 14 0 4 31 1
## 7 Chad Durbin 29 P 30 69 9 9 0 1 0 0 0 0 0
## 8 Chase Utley 3 2B 29 159 707 607 113 177 41 4 33 104 14
## 9 Chris Coste 10 C 35 98 305 274 28 72 17 0 9 36 0
## 10 Chris Snelling 22 26 4 4 4 1 2 1 0 1 1 0
## CS BB.x SO.x BA OBP OPS OPS+ TB GDP HBP.x SH SF IBB.x bat_side
## 1 0 5 6 0.179 0.303 0.553 47 7 0 0 3 0 0 R
## 2 0 1 1 0.000 0.250 0.250 -27 0 0 0 0 1 0 L
## 3 0 1 1 0.100 0.182 0.382 0 2 1 0 0 0 0 R
## 4 0 0 0 NA NA NA NA 0 0 0 0 0 0 R
## 5 0 4 19 0.069 0.129 0.215 -42 5 1 0 9 0 0 R
## 6 2 44 38 0.219 0.320 0.620 63 96 14 4 4 1 6 R
## 7 0 0 3 0.111 0.111 0.222 -42 1 0 0 0 0 0 R
## 8 2 64 104 0.292 0.380 0.915 136 325 9 27 1 8 14 L
## 9 1 16 51 0.263 0.325 0.748 93 116 7 10 3 2 1 R
## 10 0 0 0 0.500 0.500 2.000 395 6 0 0 0 0 0 L
## SLG Rk.y Pos.y Age.y W L W-L% ERA G.y GS GF CG SHO SV IP
## 1 0.2500000 5 SP 30 4 8 0.333 5.80 21 19 0 0 0 0 107.0
## 2 0.0000000 NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA
## 3 0.2000000 NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA
## 4 NaN 7 CL 31 2 0 1.000 1.95 72 0 61 0 0 41 69.1
## 5 0.0862069 3 SP 27 10 13 0.435 4.55 30 30 0 2 1 0 190.0
## 6 0.3000000 NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA
## 7 0.1111111 8 RP 30 5 4 0.556 2.87 71 0 12 0 0 1 87.2
## 8 0.5354201 NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA
## 9 0.4233577 NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA
## 10 1.5000000 NA <NA> NA NA NA NA NA NA NA NA NA NA NA NA
## H.y R.y ER HR.y BB.y IBB.y SO.y HBP.y BK WP BF ERA+ FIP WHIP H9
## 1 131 71 69 15 44 5 57 6 0 2 478 75 5.29 1.636 11.0
## 2 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4 50 17 15 2 35 4 92 1 0 5 292 224 2.41 1.226 6.5
## 5 197 103 96 29 65 6 163 6 0 2 817 96 4.52 1.379 9.3
## 6 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 7 81 33 28 5 35 7 63 4 0 3 365 152 3.77 1.323 8.3
## 8 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 10 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## HR9 BB9 SO9 SO/W pitch_side
## 1 1.3 3.7 4.8 1.30 R
## 2 NA NA NA NA <NA>
## 3 NA NA NA NA <NA>
## 4 0.3 4.5 11.9 2.63 R
## 5 1.4 3.1 7.7 2.51 R
## 6 NA NA NA NA <NA>
## 7 0.5 3.6 6.5 1.80 R
## 8 NA NA NA NA <NA>
## 9 NA NA NA NA <NA>
## 10 NA NA NA NA <NA>
Unfortunately, this gives us a bunch of diplicate attributes, but this won’t matter because we will just ignore them for our next step: summarizing. R makes summarizing really easy, as we show below. But first, there are some missing entries from the merge, we fill those in with zeros.
# Before we can summarize, we must do some cleaning.
phi_08_c[is.na(phi_08_c)] <- 0
# Now we can summarize
phi_08_c <- summarize(phi_08, Age=mean(Age.x), PA=sum(PA), AB=sum(AB), RS=sum(R.x), H=sum(H.x), Doubles=sum(`2B`), Triples=sum(`3B`), HR=sum(HR.x), BB=sum(BB.x), Batting_SO=sum(SO.x), W=sum(W), L=sum(L), IP=sum(IP), H_allowed=sum(H.y), RA=sum(ER), Pitching_SO=sum(SO.y), BB_allowed=sum(BB.y))
phi_08_c <- phi_08_c %>%
mutate(Team = "PHI") %>%
mutate(Year = 2008)
head(phi_08_c)
## Age PA AB RS H Doubles Triples HR BB Batting_SO W L IP
## 1 30 6273 5509 799 1407 291 36 214 586 1117 92 70 1449.2
## H_allowed RA Pitching_SO BB_allowed Team Year
## 1 1444 625 1081 533 PHI 2008
Now let’s compare to the original summary given by baseball-reference
head(phi_08)
## Name Rk.x Pos.x Age.x G.x PA AB R.x H.x 2B 3B HR.x RBI SB
## 1 Team Totals NA 30 162 6273 5509 799 1407 291 36 214 762 136
## CS BB.x SO.x BA OBP SLG OPS OPS+ TB GDP HBP.x SH SF IBB.x
## 1 25 586 1117 0.255 0.332 0.438 0.77 99 2412 108 67 71 40 68
## bat_side Rk.y Pos.y Age.y W L W-L% ERA G.y GS GF CG SHO SV IP
## 1 R NA 30.3 92 70 0.568 3.88 162 162 158 4 3 47 1449.2
## H.y R.y ER HR.y BB.y IBB.y SO.y HBP.y BK WP BF ERA+ FIP WHIP H9 HR9
## 1 1444 680 625 160 533 64 1081 57 3 34 6229 112 4.3 1.364 9 1
## BB9 SO9 SO/W pitch_side
## 1 3.3 6.7 2.03 R
Now we can properly set up the rest of the Philly dataframes:
phi_08 <- select(phi_08, c(4:12,16:21,33:36,45,46,57))
phi_08 <- phi_08 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2008)
phi_09 <- merge(filter(phi_09_bat,Name=="Team Totals"),filter(phi_09_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_09 <- select(phi_09, c(4:12,16:21,33:36,45,46,57))
phi_09 <- phi_09 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2009)
phi_10 <- merge(filter(phi_10_bat,Name=="Team Totals"),filter(phi_10_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_10 <- select(phi_10, c(4:12,16:21,33:36,45,46,57))
phi_10 <- phi_10 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2010)
phi_11 <- merge(filter(phi_11_bat,Name=="Team Totals"),filter(phi_11_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_11 <- select(phi_11, c(4:12,16:21,33:36,45,46,57))
phi_11 <- phi_11 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2011)
phi_12 <- merge(filter(phi_12_bat,Name=="Team Totals"),filter(phi_12_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_12 <- select(phi_12, c(4:12,16:21,33:36,45,46,57))
phi_12 <- phi_12 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2012)
phi_13 <- merge(filter(phi_13_bat,Name=="Team Totals"),filter(phi_13_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_13 <- select(phi_13, c(4:12,16:21,33:36,45,46,57))
phi_13 <- phi_13 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2013)
phi_14 <- merge(filter(phi_14_bat,Name=="Team Totals"),filter(phi_14_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_14 <- select(phi_14, c(4:12,16:21,33:36,45,46,57))
phi_14 <- phi_14 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2014)
phi_15 <- merge(filter(phi_15_bat,Name=="Team Totals"),filter(phi_15_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_15 <- select(phi_15, c(4:12,16:21,33:36,45,46,57))
phi_15 <- phi_15 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2015)
phi_16 <- merge(filter(phi_16_bat,Name=="Team Totals"),filter(phi_16_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_16 <- select(phi_16, c(4:12,16:21,33:36,45,46,57))
phi_16 <- phi_16 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2016)
phi_17 <- merge(filter(phi_17_bat,Name=="Team Totals"),filter(phi_17_pitch,Name=="Team Totals"),by="Name",all=TRUE)
phi_17 <- select(phi_17, c(4:12,16:21,33:36,45,46,57))
phi_17 <- phi_17 %>%
mutate(Team = "PHI") %>%
mutate(Year = 2017)
Once again, we’ll hide the other dataframe calculations for berevity.
Now that that has been done, let’s merge the summaries into one final dataframe. We also add in some of the more interesting averages that we couldn’t do with summary because of weighting.
sum_df <- Reduce(function(x, y) merge(x, y, all=TRUE), list(phi_08, phi_09, phi_10, phi_11, phi_12, phi_13, phi_14, phi_15, phi_16, phi_17, atl_08, atl_09, atl_10, atl_11, atl_12, atl_13, atl_14, atl_15, atl_16, atl_17, chc_08, chc_09, chc_10, chc_11, chc_12, chc_13, chc_14, chc_15, chc_15, chc_16, chc_17, hou_08, hou_09, hou_10, hou_11, hou_12, hou_13, hou_14, hou_15, hou_16, hou_17, nyy_08, nyy_09, nyy_10, nyy_11, nyy_12, nyy_13, nyy_14, nyy_15, nyy_16, nyy_17)) %>%
rename(Age=Age.x) %>%
rename(G=G.x) %>%
rename(RS=R.x) %>%
rename(H=H.x) %>%
rename(HR=HR.x) %>%
rename(BB=BB.x) %>%
rename(Batting_SO=SO.x) %>%
rename(Winning_pct=`W-L%`) %>%
rename(RA=R.y) %>%
rename(Doubles=`2B`) %>%
rename(Triples=`3B`)
head(sum_df, 10)
## Age G PA AB RS H Doubles Triples HR BB Batting_SO BA
## 1 25.5 162 6055 5447 629 1317 240 19 163 495 1442 0.242
## 2 25.9 162 6020 5457 610 1307 266 16 148 426 1535 0.240
## 3 26.3 162 6073 5459 729 1363 278 26 230 486 1392 0.250
## 4 26.5 162 6133 5535 690 1382 287 36 174 494 1417 0.250
## 5 26.5 162 6204 5545 724 1367 291 29 198 554 1452 0.247
## 6 26.6 162 6014 5407 583 1276 238 28 146 463 1365 0.236
## 7 26.7 162 6102 5508 614 1315 270 31 157 442 1477 0.239
## 8 26.7 162 6200 5491 689 1341 272 30 171 567 1518 0.244
## 9 26.8 162 6064 5468 573 1316 240 22 123 472 1369 0.241
## 10 26.8 162 6133 5441 688 1354 247 21 181 542 1384 0.249
## OBP SLG OPS W L Winning_pct ERA RA ER WHIP Team Year
## 1 0.309 0.383 0.692 70 92 0.432 4.11 723 657 1.335 HOU 2014
## 2 0.299 0.375 0.674 51 111 0.315 4.79 848 766 1.490 HOU 2013
## 3 0.315 0.437 0.752 86 76 0.531 3.57 618 572 1.201 HOU 2015
## 4 0.315 0.409 0.723 66 96 0.407 4.55 782 729 1.387 PHI 2017
## 5 0.319 0.417 0.735 76 86 0.469 4.09 729 654 1.385 HOU 2016
## 6 0.302 0.371 0.673 55 107 0.340 4.56 794 721 1.428 HOU 2012
## 7 0.300 0.385 0.684 73 89 0.451 3.91 707 636 1.300 CHC 2014
## 8 0.321 0.398 0.719 97 65 0.599 3.36 608 546 1.152 CHC 2015
## 9 0.305 0.360 0.665 79 83 0.488 3.38 597 547 1.265 ATL 2014
## 10 0.321 0.402 0.723 96 66 0.593 3.18 548 512 1.196 ATL 2013
Could we have gotten to this point in an easier fashion? Sure. We could have pulled the summary statistics for each team across all years from baseball-references website, but now we have an idea as to what can go into these kinds of projects.
Now that we have all this data that we’ve reduced down to, what does it mean? Well, as it turns out, there was some data loss in all that transformation, but we’re going to roll with it, as there is still enough data to make a point. It could mean a whole lot of things, and one of the easiest ways is to plot it. R makes plotting fairly straightforward with ggplot, so let’s get down to it and see how age and winning percentage align.
ggplot(sum_df, aes(x=Year, y=Winning_pct, colour=Team)) + geom_point() + geom_line() +
scale_color_manual(values=c("darkblue","blue","orange","black","red"))
ggplot(sum_df, aes(x=Year, y=Age, colour=Team)) + geom_point() + geom_line() +
scale_color_manual(values=c("darkblue","blue","orange","black","red"))
ggplot(sum_df, aes(x=Age, y=Winning_pct, colour=Team)) + geom_point() +
scale_color_manual(values=c("darkblue","blue","orange","black","red"))
Now, we can get a little fancy, creating a 3d graph using plotly
plot_ly(sum_df, x = ~Year, y = ~Age, z = ~Winning_pct, color = ~Team, colors = c("darkblue","blue","orange","black","red")) %>%
add_markers() %>%
layout(scene = list(xaxis = list(title = 'Year'),
yaxis = list(title = 'Age'),
zaxis = list(title = 'Winning Percentage')))
The Age vs Winning Percentage Graph tells a lot, if you know some of the numbers behind the data. There are four main spikes in winning percentage, one at roughly 27, one at 28.7, one at 30.7, and one at 31.7. This seems to disprove our original idea that young teams are the way to go. However, consider the first two spikes. The tend is generally linear, so the first spike is rather strange. It’s where teams are first starting to get that young talent in the majors. With a few years experience, and some added veterens, that average age will go up, but so will the wins.
One other thing we can discuss about this data is the central tendency, spread, skew, and outliers. Central tendency is the data averages to. For example, most of the ages in our original datasets averaged to roughly 28 years old. The spread is the range that this data is between. For example, in 2018 Ichiro Suzuki is an impressive 44 years old. Spread also relates to where the data lies, most players tend to be in the 27 to 30 range. Speaking of Ronald Acuna and Ichiro Suzuki, both are examples of outliers. Outliers are data points that are significantly outside the expected range. Almost all players won’t make it to their age 40 season, let alone their 44th. Skew is whether that data tends to lean to one side of the average or the other.
The best way to see these statistical trends is through boxplots. Below, we show boxplots for each team and their average age in each year.
ggplot(sum_df,aes(x=Team, y=Age,group=Team)) +
geom_boxplot() + ylab("Age") + xlab("Team")
Here we notice some interesting things about each team. Houston has the largest spread, which makes sense. Houston, over the past two or three years, has been going out to get veteran players to help complete their team. Meanwhile, the Yankees have the lowest spread, and the highest central tendency. More than the other teams here, they have the money to go out and get top players, who are typically older. The Phillies, who have the second highest spread, are also significantly skewed. Most of their players have been in the league a long time, so most of the players are pretty old (as far as player standards go). Then, as they rebuilt, they went out and got veteran players who weren’t top tier, but might have a breakout season, keeping that average age up. In more recent years, the Phillies have called up their young players, increasing the spread.
Pythagorean Win Loss is one of the more interesting statistics that sabermatricians have come up with. Given the amount of runs scored for and against a team, it will compute what their expected winning percentage will be. In this example, we’ll do some hypothesis testing to see how valid Pythagorean Win Loss is against our dataset.
First, we have to add the expected winning percentage to our dataset. Pythagorean Win Loss is calculated as \[\text{win}\% = \frac{RS^2}{RS^2+RA^2}\] Depending on the sport, those exponents can change, tweaking the winning percentage to fit correctly.
sum_df <- mutate(sum_df, pyth=(RS^2)/((RS^2) + (RA^2)))
we can model winning percentage as a Bernoulli distribution where each game will give us one of two results: a win or a loss, with the odds being the winning percentage. So that means that \(\hat{p}\) is the winning percentage, which we can use to calculate the varience. From there, we can calculate the odds of the Pythagorean Win-Loss being outside of our winning percentage model. Normally, if looking for statistical improvement, you would want your predicted odds to be outside the realm of possible values for the original probability, but in our case, we want the odds to always be in that realm of possible values. If the odds are less than 5%, then we can reject the null hypothesis and say that the Pythagorean Win-Loss model is invalid.
hyp_df <- sum_df %>%
select(Team,Year,G,W,L,Winning_pct,pyth) %>%
mutate(winning_var = Winning_pct*(1-Winning_pct)/G) %>%
mutate(p = 1-pnorm(pyth,Winning_pct,winning_var^.5)) %>%
mutate(valid = ifelse(p > 0.95 | p < 0.05,FALSE,TRUE))
As we can see, that there is statisticially little chance of being outside the realm of possible values for winning percentage is very little. Only two values had winning percentages significantly outside their Pythagorean Win-Loss expected winning percentage. Both the 2016 Phillies and Yankees beat their expected Pythagorean Win-Loss. Thus, we can say with statistical certainty that Pythagorean Win-Loss is an accurate metric.
Machine Learning is a hot topic in computer science. Machine learning uses statistics to help a computer improve its performance by itself. For example, Elon Musk’s OpenAI company developed a computer that beat professional eSports players. Using “trial by error” techniques, the computer was able to teach itself how to play a simplified version of Dota 2, a game with a notoriously large learning curve. It completely shut down beloved pro player Dendi.
In our example, we’ll cut out 2016-2017 to be our test year and use the rest to train a random forest to predict if a team has a winning percentage. A random forest is a collection of decision trees that it uses to predict the outcome. Each decision tree takes the input and, like a binary search tree, sends that input further down the line until it reaches the final node, which will give the prediction and it’s likelyhood. For our example, we will use all the possible statistics except for year and team. Note that the function randomForest comes from the randomForest package.
#First, add conditional if the team had a winning record
final_df <- mutate(sum_df, winning_record=ifelse(Winning_pct>0.5,"YES","NO"))
final_df <- mutate(final_df, winning_record=factor(winning_record, levels=c("YES", "NO")))
final_df <- select(final_df, -c(Team,W,L,Winning_pct))
#then, we strip out the 2017 year
test_df <- filter(final_df,Year==2017|Year==2016)
final_df <- filter(final_df,Year!=2017|Year!=2016)
#now we create the random forest
rf <- randomForest(winning_record~., data = final_df %>% select(-Year))
#now we see how well our prediction did
test_predictions <- predict(rf, newdata=test_df %>% select(-Year))
#and print out our results
table(pred=test_predictions, observed=test_df$winning_record)
## observed
## pred YES NO
## YES 3 0
## NO 0 6
Looks like our random forest predictor is spot on. Every time it predicted a winning record, the team got a winning record, which is pretty good.
Sabermetrics - The study of baseball statistics to evaluate how well teams and players do
Farm System - A nickname for the collection of a franchise’s minor league teams. This is where young players develop their craft against other young players, and some veteran players.
The Show - Also called the Bigs, its another name for Major League Baseball
WAR - Wins Above Replacement, an arbitrary calculation to compare how good a team is compared to another.
Dataframe - An object/matrix that stores the data we are interested in. Each column contains a particular attribute for each row, or entry.
CSV - Comma Separated Values, a plaintext file where each value is separated by a comma and each entry separated by a newline character. It is an easy way to store data.
G - Number of Games played
PA - Plate Appearances, number of times a player has come to bat
AB - At-Bats, number of at bats a player has had. It’s different from plate appearances as walks don’t count towards at-bats
RS - Runs Scored, number of times a team has scored
H - Hits, number of times a player has reach at least first base with a ball put in play
2B - Doubles, number of times a player has reached second safely on a ball hit in play
3B - Triples, number of times a player has reached third safely on a ball hit in play
HR - Homeruns, number of times a player has rounded the bases by either hitting the ball over the outfield fence in fair territory, or by rounding all four bases on a single ball hit in play
BB - Batter Based, number of times a batter has walked. In hitting, its how many times the batter has walked; in pitching its how many times a pitcher has walked an opposing batter
SO - Strike Outs, also known as a K. In hitting, its the number of times a batter has struck out and in pitching its how many strikes a pitcher has collected
BA - Batting Average, also AVG, its the average number of times a player has reached at least first safely. Walks, errors, and fielder’s choices don’t count
OBP - On-Base Percentage, the average number of times a player has reached first either by a hit or by a walk
SLG - Slugging Percentage, weighted average each type of hit. Each one is weighed as such: singles by one, doubles by two, triples by three, and homeruns by four.
OPS - On-base Plus Slugging, pretty self explanatory
W - Wins, for a pitcher, its the number of times their team has taken the lead while they were pitching and not given the lead back up. For a team, its the total number of wins
L - Losses, like wins, for a pitcher, its the number of times they have given up the lead and their team has not gotten it back. For a team, its their total number of losses.
Winning Percentage - Pretty self explanatory, its the number of wins over the total number of games played
ERA - Earned Run Average, its the average number of runs a pitcher gives up in 9 innings of work
RA - Runs allowed
ER - Earned Runs Allowed, this subtracts runs given up because of errors
WHIP - Walks and Hits per Inning Pitched, the number of walks and hits a pitcher gives up on average per inning of work